Goto

Collaborating Authors

 Rockville



PredictiveInferenceIsFreewiththe Jackknife+-after-Bootstrap

Neural Information Processing Systems

Ensemble learning is a popular technique for enhancing the performance of machine learning algorithms. It is used to capture a complex model space with simple hypotheses which are often significantly easier to learn, or to increase the accuracy of an otherwise unstable procedure [see 14,27,29,andreferencestherein].


Machine learning to optimize precision in the analysis of randomized trials: A journey in pre-specified, yet data-adaptive learning

arXiv.org Machine Learning

Covariate adjustment is an approach to improve the precision of trial analyses by adjusting for baseline variables that are prognostic of the primary endpoint. Motivated by the SEARCH Universal HIV Test-and-Treat Trial (2013-2017), we tell our story of developing, evaluating, and implementing a machine learning-based approach for covariate adjustment. We provide the rationale for as well as the practical concerns with such an approach for estimating marginal effects. Using schematics, we illustrate our procedure: targeted machine learning estimation (TMLE) with Adaptive Pre-specification. Briefly, sample-splitting is used to data-adaptively select the combination of estimators of the outcome regression (i.e., the conditional expectation of the outcome given the trial arm and covariates) and known propensity score (i.e., the conditional probability of being randomized to the intervention given the covariates) that minimizes the cross-validated variance estimate and, thereby, maximizes empirical efficiency. We discuss our approach for evaluating finite sample performance with parametric and plasmode simulations, pre-specifying the Statistical Analysis Plan, and unblinding in real-time on video conference with our colleagues from around the world. We present the results from applying our approach in the primary, pre-specified analysis of 8 recently published trials (2022-2024). We conclude with practical recommendations and an invitation to implement our approach in the primary analysis of your next trial.


PULSE-ICU: A Pretrained Unified Long-Sequence Encoder for Multi-task Prediction in Intensive Care Units

arXiv.org Artificial Intelligence

Intensive care unit (ICU) data are highly irregular, heterogeneous, and temporally fragmented, posing challenges for generalizable clinical prediction. We present PULSE-ICU, a self-supervised foundation model that learns event-level ICU representations from large-scale EHR sequences without resampling or manual feature engineering. A unified embedding module encodes event identity, continuous values, units, and temporal attributes, while a Longformer-based encoder enables efficient modeling of long trajectories. PULSE-ICU was fine-tuned across 18 prediction tasks, including mortality, intervention forecasting, and phenotype identification, achieving strong performance across task types. External validation on eICU, HiRID, and P12 showed substantial improvements with minimal fine-tuning, demonstrating robustness to domain shift and variable constraints. These findings suggest that foundation-style modeling can improve data efficiency and adaptability, providing a scalable framework for ICU decision support across diverse clinical environments.



Semiparametric Learning from Open-Set Label Shift Data

arXiv.org Machine Learning

We study the open-set label shift problem, where the test data may include a novel class absent from training. This setting is challenging because both the class proportions and the distribution of the novel class are not identifiable without extra assumptions. Existing approaches often rely on restrictive separability conditions, prior knowledge, or computationally infeasible procedures, and some may lack theoretical guarantees. We propose a semiparametric density ratio model framework that ensures identifiability while allowing overlap between novel and known classes. Within this framework, we develop maximum empirical likelihood estimators and confidence intervals for class proportions, establish their asymptotic validity, and design a stable Expectation-Maximization algorithm for computation. We further construct an approximately optimal classifier based on posterior probabilities with theoretical guarantees. Simulations and a real data application confirm that our methods improve both estimation accuracy and classification performance compared with existing approaches.


Regularizing Log-Linear Cost Models for Inpatient Stays by Merging ICD-10 Codes

arXiv.org Machine Learning

Cost models in healthcare research must balance interpretability, accuracy, and parameter consistency. However, interpretable models often struggle to achieve both accuracy and consistency. Ordinary least squares (OLS) models for high-dimensional regression can be accurate but fail to produce stable regression coefficients over time when using highly granular ICD-10 diagnostic codes as predictors. This instability arises because many ICD-10 codes are infrequent in healthcare datasets. While regularization methods such as Ridge can address this issue, they risk discarding important predictors. Here, we demonstrate that reducing the granularity of ICD-10 codes is an effective regularization strategy within OLS while preserving the representation of all diagnostic code categories. By truncating ICD-10 codes from seven characters (e.g., T67.0XXA, T67.0XXD) to six (e.g., T67.0XX) or fewer, we reduce the dimensionality of the regression problem while maintaining model interpretability and consistency. Mathematically, the merging of predictors in OLS leads to increased trace of the Hessian matrix, which reduces the variance of coefficient estimation. Our findings explain why broader diagnostic groupings like DRGs and HCC codes are favored over highly granular ICD-10 codes in real-world risk adjustment and cost models.


A Weakly Supervised Transformer to Support Rare Disease Diagnosis from Electronic Health Records: Methods and Applications in Rare Pulmonary Disease

arXiv.org Machine Learning

Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions often remain poorly characterized and difficult to diagnose due to their low prevalence and limited clinician familiarity. While computational phenotyping algorithms show promise for automating rare disease detection, their development is hindered by the scarcity of labeled data and biases in existing label sources. Gold-standard labels from registries and expert chart reviews are highly accurate but constrained by selection bias and the cost of manual review. In contrast, labels derived from electronic health records (EHRs) cover a broader range of patients but can introduce substantial noise. To address these challenges, we propose a weakly supervised, transformer-based framework that combines a small set of gold-standard labels with a large volume of iteratively updated silver-standard labels derived from EHR data. This hybrid approach enables the training of a highly accurate and generalizable phenotyping model that scales rare disease detection beyond the scope of individual clinical expertise. Our method is initialized by learning embeddings of medical concepts based on their semantic meaning or co-occurrence patterns in EHRs, which are then refined and aggregated into patient-level representations via a multi-layer transformer architecture. Using two rare pulmonary diseases as a case study, we validate our model on EHR data from Boston Children's Hospital. Our framework demonstrates notable improvements in phenotype classification, identification of clinically meaningful subphenotypes through patient clustering, and prediction of disease progression compared to baseline methods. These results highlight the potential of our approach to enable scalable identification and stratification of rare disease patients for clinical care and research applications.


State and Memory is All You Need for Robust and Reliable AI Agents

arXiv.org Artificial Intelligence

Large language models (LLMs) have enabled powerful advances in natural language understanding and generation. Yet their application to complex, real-world scientific workflows remain limited by challenges in memory, planning, and tool integration. Here, we introduce SciBORG (Scientific Bespoke Artificial Intelligence Agents Optimized for Research Goals), a modular agentic framework that allows LLM-based agents to autonomously plan, reason, and achieve robust and reliable domain-specific task execution. Agents are constructed dynamically from source code documentation and augmented with finite-state automata (FSA) memory, enabling persistent state tracking and context-aware decision-making. This approach eliminates the need for manual prompt engineering and allows for robust, scalable deployment across diverse applications via maintaining context across extended workflows and to recover from tool or execution failures. We validate SciBORG through integration with both physical and virtual hardware, such as microwave synthesizers for executing user-specified reactions, with context-aware decision making and demonstrate its use in autonomous multi-step bioassay retrieval from the PubChem database utilizing multi-step planning, reasoning, agent-to-agent communication and coordination for execution of exploratory tasks. Systematic benchmarking shows that SciBORG agents achieve reliable execution, adaptive planning, and interpretable state transitions. Our results show that memory and state awareness are critical enablers of agentic planning and reliability, offering a generalizable foundation for deploying AI agents in complex environments.


The Longitudinal Health, Income, and Employment Model (LHIEM): a discrete-time microsimulation model for policy analysis

arXiv.org Machine Learning

Dynamic microsimulation has long been recognized as a powerful tool for policy analysis, but in fact most major health policy simulations lack path dependency, a critical feature for evaluating policies that depend on accumulated outcomes such as retirement savings, wealth, or debt. We propose the Longitudinal Health, Income and Employment Model (LHIEM), a path-dependent discrete-time microsimulation that predicts annual health care expenditures, family income, and health status for the U.S. population over a multi-year period. LHIEM advances the population from year to year as a Markov chain with modules capturing the particular dynamics of each predictive attribute. LHIEM was designed to assess a health care financing proposal that would allow individuals to borrow from the U.S. government to cover health care costs, requiring careful tracking of medical expenditures and medical debt over time. However, LHIEM is flexible enough to be used for a range of modeling needs related to predicting health care spending and income over time. In this paper, we present the details of the model and all dynamic modules, and include a case study to demonstrate how LHIEM can be used to evaluate proposed policy changes.